# Model Competition and DPO Training Pipeline

A pipeline for running model competitions and DPO (Direct Preference Optimization) training with multiple language models.

## Overview

This project implements a pipeline that:
1. Runs competitions between different language models
2. Performs DPO training on the models in parallel
3. Supports multiple tasks and evaluation metrics
4. Uses a static or dynamic rating system

## Requirements

- CUDA-compatible GPUs
- Conda environment with Python 3.11
- Required packages can be installed using the provided `environment.yml`


## Configuration

Key configuration parameters in `run_pipeline.sh`:

### General Configuration
- `BASE_MODEL`: Base model type ("qwen" or "gemma")
- `MAX_ITERATIONS`: Number of training iterations
- `GPU_IDS`: GPUs to use (comma-separated)
- `BATCH_SIZE`: Batch size for training
- `MODEL_NAMES`: Models to include in competition

### Competition Configuration
- `TASK`: Task name (gsm8k, medqa, culture_country, etc.)
- `SCORE_TYPE`: Rating system type (static/dynamic)
- `USE_FAIR`: Whether to use fair competition
- `NUM_OPPONENTS`: Number of opponents per round
- `RANDOM_MATCH_PROB`: Probability of random opponent selection
- `DIFFICULTY`: Difficulty of the MATH task (easy/medium/hard)

### DPO Configuration
- `NUM_EPOCHS`: Number of training epochs
- `LR`: Learning rate
- `MSEQLEN`: Maximum sequence length
- `QLORA`: Whether to use QLoRA training

## Usage

1. Set up the environment:
```
# Create and activate conda environment
conda env create -f environment.yml -n your_env_name
conda activate your_env_name
```
2. Prepare your models:
   - Place base models in `base_model/`
   - Initialize models will be stored in `init_model/`

3. Run the pipeline:
```
bash run_pipeline.sh
```

## Supported Tasks

- `gsm8k`: Mathematical reasoning
- `medqa`: Medical knowledge QA
- `culture_country`: Cultural knowledge about countries
- `culture_value`: Cultural values assessment
- `culture_rule_of_thumb`: Common sense rules
- `alpaca`: General instruction following
- `truthfulqa`: Truthfulness assessment
- `knowledge_crossword`: Knowledge-based crossword puzzles
- `MATH`: Math reasoning

## Output

The pipeline creates an experiment directory under `dpo_model/` with:
- Model checkpoints for each iteration
- Competition results
- Evaluation metrics
- Parameter configurations

## Notes

- The pipeline uses GPU memory management and parallel processing
- Models are trained in groups to optimize GPU utilization
- Competition results are saved after each iteration
- The pipeline includes automatic GPU cleaning between iterations

## License

[MIT License]
